Load the functions

source("thesis_functions.R")
dhist.plot <- function(d, v, vname){
  ggplot(d, aes(x=v)) + 
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(d$v), sd = sd(d$v)), 
    lwd = 1, 
    col = 'red'
  ) +
  # xlab(vname) + ylab("Density") +
  # ggtitle(paste(vname, "Distribution")) +
  theme_bw()
}

set levels of ordinal data, and other variables to their appropriate dataypes. Did not adjust the datatypes for date/time since those will not be used in this portion

data <- fread("kpopdata.csv")
data <- mutate(data, ArtistType = as.factor(ArtistType),
               ArtistGender = as.factor(ArtistGender),
               ArtistGen = factor(ArtistGen),
               release_date = as.POSIXct(release_date, format = "%m/%d/%y"),
               key = factor(key, levels = 0:11),
               mode = as.factor(mode),
               time_signature = factor(time_signature, levels = c(1,3,4,5)))

#make month/year as continuous data in the form of number of months (~360)
ref.year = 1992 #we are using January 1992 as the reference
data$months = 12*(year(data$release_date) - ref.year) + month(data$release_date)
#View(select(data, Artist, song_name, release_date, months))

Testing how to transform back from a log function ... this works :D

data$months[1:10]
##  [1] 243 243 243 243 243 243 243 240 345 309
log.trans <- log(360-data$months);log.trans[1:10]
##  [1] 4.762174 4.762174 4.762174 4.762174 4.762174 4.762174 4.762174 4.787492
##  [9] 2.708050 3.931826
back.trans <- (exp(log.trans) - 360) / -1;back.trans[1:10]
##  [1] 243 243 243 243 243 243 243 240 345 309

Assessing normality of the Response Variable: Month of release date.

original months value

Summary of the data to gain an understanding of the range.

summary(data$months)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0   241.0   285.0   266.3   317.0   349.0

check out distribution of the months

#dhist.plot(data, months, "Month Released")
ggplot(data, aes(x=months)) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(data$months), sd = sd(data$months)), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released") + ylab("Density") +
  ggtitle(paste("Month Released", "Distribution")) +
  theme_bw()

#produce normal qqplot to assess normality
qqnorm(data$months, pch = 1, frame = FALSE)
qqline(data$months, col = "red", lwd = 2)

#Skewness score
skewness(data$months)
## [1] -1.342445

try to find a transformation to normalize. As we can see the distribution of months is severely skewed to the left.

log transform for moderately negative skewed data

A common distribution towards normality for moderately skewed data is : \(log(max(y+1) - y)\), since the max number of months is 349, the transformation would be : \(log(350 - y)\)

summary of the data under this transformation

summary(log(350-data$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.497   4.174   4.029   4.691   5.849

Already by looking at the 5 number summary it is evidence that there exists skewness isnce the median is much closer to the max than the min.

ggplot(data, aes(x=log(350-data$months))) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(log(350-data$months)), sd = sd(log(350-data$months))), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released : log(350 - y))") + ylab("Density") +
  ggtitle(paste("Month Released : log(350 - y))", "Distribution")) +
  theme_bw()

#produce normal qqplot to assess normality
qqnorm(log(350-data$months), pch = 1, frame = FALSE)
qqline(log(350-data$months), col = "red", lwd = 2)

#Skewness score
skewness(log(350-data$months))
## [1] -0.8029814

The distribution looks more normal, but still has noticeable issues of skewness. Therefore, the transformation has not been successful in achieving normality as the observations deviate to much from the normal QQ line.

Since there still exists issues of skewness, we will tweak the standard transformation of negative skewness to this alternative: \(log(360 - y)\)

summary of the data under this transformation

summary(log(360-data$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.398   3.761   4.317   4.269   4.779   5.878

Notice that the maximum and minimum value are still very similar to before, however the minimum has increased with this change and now the median is at a more central location than before.

ggplot(data, aes(x=log(360-data$months))) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(log(360-data$months)), sd = sd(log(360-data$months))), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released : log(360 - y))") + ylab("Density") +
  ggtitle(paste("Month Released : log(360 - y))", "Distribution")) +
  theme_bw()

#produce normal qqplot to assess normality
qqnorm(log(360-data$months), pch = 1, frame = FALSE)
qqline(log(360-data$months), col = "red", lwd = 2)

#calculate skewness
skewness(log(360-data$months))
## [1] -0.2249134

The distribution of months has greatly improved and looks roughly normal. Furthermore, the normal qqplot shows great improvement

square root transform for moderately negative skewed data

summary(sqrt(350-data$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.745   8.062   8.382  10.440  18.628

Another common transformation for moderate skewed data is a square root approach: \(\sqrt{max(y+1) - y}\) = \(\sqrt{(350 - y)}\)

ggplot(data, aes(x=sqrt(350-data$months))) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(sqrt(350-data$months)), sd = sd(sqrt(350-data$months))), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released : sqrt(350 - y))") + ylab("Density") +
  ggtitle(paste("Month Released : sqrt(350 - y))", "Distribution")) +
  theme_bw()

#produce normal qqplot to assess normality
qqnorm(sqrt(350-data$months), pch = 1, frame = FALSE)
qqline(sqrt(350-data$months), col = "red", lwd = 2)

#calculate skewness
skewness(sqrt(350-data$months))
## [1] 0.433938

It is still somewhat positively skewed, but pretty good!

Inverse transform for severely negative skewed data

Lastly, a common transformation for severely negatively skewed data takes an inverse approach: \(\frac{1}{(max(y)+1) - y)}\) = \(\frac{1}{(350 - y)}\)

ggplot(data, aes(x=1/(350-data$months))) +
  geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
  stat_function(
    fun = dnorm, 
    args = list(mean = mean(1/(350-data$months)), sd = sd(1/(350-data$months))), 
    lwd = 1, 
    col = 'red'
  ) +
  xlab("Month Released : 1/(350 - y))") + ylab("Density") +
  ggtitle(paste("Month Released : 1/(350 - y))", "Distribution")) +
  theme_bw()

This is a terrible idea. no.

splitting into training and test data.

Create Training (75%) and Test data (25%) to train classification models on.

kpop <- dplyr::select(data, months, popularity, duration, acousticness, danceability, energy, instrumentalness, key, loudness, mode, speechiness, tempo, valence)
kpop0 <- kpop %>% dplyr::filter(mode == 0) %>% dplyr::select(-mode)
kpop1 <- kpop %>% dplyr::filter(mode == 1) %>% dplyr::select(-mode)

### Kpop mode 0 train and test
smpl.size0 <- floor(0.75*nrow(kpop0))
set.seed(123)
smpl0 <- sample(nrow(kpop0), smpl.size0, replace = FALSE)
train0 <- kpop0[smpl0,]
test0 <- kpop0[-smpl0,]

### Kpop mode 1 train and test
smpl.size1 <- floor(0.75*nrow(kpop1))
set.seed(123)
smpl1 <- sample(nrow(kpop1), smpl.size1, replace = FALSE)
train1 <- kpop1[smpl1,]
test1 <- kpop1[-smpl1,]

Assessing performance of the transformations on Multiple linear regression for mode 0 songs

ml0 <- lm(months ~. , data = train0)
summary(ml0)
## 
## Call:
## lm(formula = months ~ ., data = train0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -220.905  -29.308    6.643   37.204  199.560 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      506.91441   16.88344  30.024  < 2e-16 ***
## popularity         1.60481    0.05435  29.528  < 2e-16 ***
## duration         -20.26179    1.81712 -11.151  < 2e-16 ***
## acousticness      31.41991    6.58842   4.769 1.93e-06 ***
## danceability     -61.87134   10.53878  -5.871 4.74e-09 ***
## energy           -98.67824   10.85541  -9.090  < 2e-16 ***
## instrumentalness  69.35253   14.06277   4.932 8.54e-07 ***
## key1              -3.38890    4.81002  -0.705 0.481137    
## key2              -8.25574    5.95496  -1.386 0.165724    
## key3              -4.68988    6.66026  -0.704 0.481381    
## key4             -16.64198    4.72042  -3.526 0.000428 ***
## key5              -6.37892    4.57708  -1.394 0.163508    
## key6              -7.69000    4.77927  -1.609 0.107700    
## key7              -3.91007    5.23715  -0.747 0.455354    
## key8              -1.77216    5.65747  -0.313 0.754115    
## key9             -16.31280    4.85082  -3.363 0.000780 ***
## key10            -13.05634    4.99295  -2.615 0.008962 ** 
## key11            -13.43026    4.45880  -3.012 0.002613 ** 
## loudness          14.12750    0.67833  20.827  < 2e-16 ***
## speechiness       39.57966   15.28264   2.590 0.009642 ** 
## tempo             -0.12300    0.04024  -3.057 0.002255 ** 
## valence          -25.15503    5.77332  -4.357 1.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.61 on 3482 degrees of freedom
## Multiple R-squared:  0.4143, Adjusted R-squared:  0.4108 
## F-statistic: 117.3 on 21 and 3482 DF,  p-value: < 2.2e-16

check assumptions with diagnostic plots

#par(mfrow = c(4,1))
plot(ml0)

*Scale-Location: Due to the clear decreasing line rather thana flat horizontal line of equally spread points, there is clear evidence of violation for homogeneity of the variance of the residuals. May need to run the model on a transformation of the outcome variable which is the months of song release.

qqnorm(ml0$residuals)
qqline(ml0$residuals, col = "red")

checking for multicollinearity

car::vif(ml0)
##                      GVIF Df GVIF^(1/(2*Df))
## popularity       1.166795  1        1.080183
## duration         1.095607  1        1.046713
## acousticness     1.572984  1        1.254187
## danceability     1.463926  1        1.209928
## energy           2.625882  1        1.620457
## instrumentalness 1.062607  1        1.030828
## key              1.085836 11        1.003750
## loudness         1.974944  1        1.405327
## speechiness      1.078396  1        1.038458
## tempo            1.112227  1        1.054622
## valence          1.540583  1        1.241202

Overall, the current model does not meet the assumptions of the linear model. We need to make a transformation on the response variable to achieve normality of the residuals

yhat.ml0 = predict(ml0, newdata = test0)
data.frame(
  RMSE = RMSE(yhat.ml0, test0$months),
  R2 = R2(yhat.ml0, test0$months)
)
##       RMSE        R2
## 1 56.76951 0.3780801

MLR: Box Cox Transformation

boxcox(ml0,lambda = seq(1.7, 3, 0.1), plotit = TRUE)

Optimal transformation is for \(\lambda = 2.3\)

summary(bc.transform(data$months, 2.3))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      5.01 130891.04 192492.74 180682.07 245871.22 306739.73
hist(bc.transform(data$months, 2.3))

ml0.bc <- lm(bc.transform(months, 2.3) ~. , data = train0)
summary(ml0.bc)
## 
## Call:
## lm(formula = bc.transform(months, 2.3) ~ ., data = train0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -182515  -45410    3662   45295  209577 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      387391.90   19370.62  19.999  < 2e-16 ***
## popularity         2137.24      62.36  34.275  < 2e-16 ***
## duration         -22523.94    2084.80 -10.804  < 2e-16 ***
## acousticness      33787.57    7558.99   4.470 8.08e-06 ***
## danceability     -61484.25   12091.30  -5.085 3.87e-07 ***
## energy           -72490.65   12454.57  -5.820 6.40e-09 ***
## instrumentalness  79343.44   16134.42   4.918 9.16e-07 ***
## key1              -1387.89    5518.61  -0.251  0.80145    
## key2              -5870.47    6832.21  -0.859  0.39027    
## key3              -6175.19    7641.41  -0.808  0.41908    
## key4             -15890.01    5415.81  -2.934  0.00337 ** 
## key5              -4046.15    5251.36  -0.770  0.44106    
## key6              -6788.43    5483.33  -1.238  0.21580    
## key7                670.53    6008.66   0.112  0.91115    
## key8                661.37    6490.90   0.102  0.91885    
## key9             -15530.42    5565.41  -2.791  0.00529 ** 
## key10            -15579.34    5728.48  -2.720  0.00657 ** 
## key11            -12658.78    5115.65  -2.475  0.01339 *  
## loudness          11349.58     778.26  14.583  < 2e-16 ***
## speechiness       37433.06   17534.00   2.135  0.03284 *  
## tempo              -128.82      46.17  -2.790  0.00530 ** 
## valence          -33700.33    6623.82  -5.088 3.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63800 on 3482 degrees of freedom
## Multiple R-squared:  0.4159, Adjusted R-squared:  0.4124 
## F-statistic: 118.1 on 21 and 3482 DF,  p-value: < 2.2e-16

check diagnostic plots

plot(ml0.bc)

normality has improved but nothing else ...

yhat.ml0.bc = predict(ml0.bc, newdata = test0 %>% mutate(months = bc.transform(test0$months, 2.3)))
data.frame(
  RMSE = RMSE(yhat.ml0.bc, bc.transform(test0$months, 2.3)),
  R2 = R2(yhat.ml0.bc, bc.transform(test0$months, 2.3))
)
##       RMSE        R2
## 1 64627.64 0.3899321

MLR: log(360 - y) Transformation

hist(log(360-train0$months))

summary(log(360-train0$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.398   3.732   4.304   4.261   4.796   5.878
ml0.log360 <- lm(log(360-train0$months) ~. , data = train0)
summary(ml0.log360)
## 
## Call:
## lm(formula = log(360 - train0$months) ~ ., data = train0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39175 -0.40146  0.04915  0.44693  1.61319 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.7389301  0.1848139  14.820  < 2e-16 ***
## popularity       -0.0215638  0.0005949 -36.246  < 2e-16 ***
## duration          0.1940970  0.0198910   9.758  < 2e-16 ***
## acousticness     -0.2401500  0.0721199  -3.330 0.000878 ***
## danceability      0.4237910  0.1153624   3.674 0.000243 ***
## energy            0.5720095  0.1188284   4.814 1.54e-06 ***
## instrumentalness -0.8481059  0.1539376  -5.509 3.86e-08 ***
## key1             -0.0127793  0.0526528  -0.243 0.808245    
## key2              0.0157264  0.0651857   0.241 0.809372    
## key3              0.0354738  0.0729063   0.487 0.626596    
## key4              0.1181835  0.0516720   2.287 0.022245 *  
## key5              0.0170423  0.0501029   0.340 0.733768    
## key6              0.0505054  0.0523161   0.965 0.334417    
## key7             -0.0435365  0.0573283  -0.759 0.447650    
## key8             -0.0555081  0.0619293  -0.896 0.370147    
## key9              0.1087090  0.0530993   2.047 0.040706 *  
## key10             0.1264746  0.0546551   2.314 0.020723 *  
## key11             0.1012854  0.0488081   2.075 0.038044 *  
## loudness         -0.0871796  0.0074253 -11.741  < 2e-16 ***
## speechiness      -0.3296733  0.1672909  -1.971 0.048842 *  
## tempo             0.0009236  0.0004405   2.097 0.036072 *  
## valence           0.3243115  0.0631975   5.132 3.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6087 on 3482 degrees of freedom
## Multiple R-squared:  0.4105, Adjusted R-squared:  0.4069 
## F-statistic: 115.4 on 21 and 3482 DF,  p-value: < 2.2e-16

check diagnostic plots

plot(ml0.log360)

Predictions and performance diagnostics

yhat.ml0.log360 = predict(ml0.log360, newdata = test0 %>% mutate(months = log(360 - test0$months)))
data.frame(
  RMSE = RMSE(yhat.ml0.log360, log(360 - test0$months)),
  R2 = R2(yhat.ml0.log360, log(360 - test0$months))
)
##        RMSE        R2
## 1 0.5976752 0.3932124

MLR: sqrt(350 - y) Transformation

hist(sqrt(350-train0$months))

summary(sqrt(350-train0$months))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.635   8.000   8.377  10.536  18.628
ml0.sqrt350 <- lm(sqrt(350-train0$months) ~. , data = train0)
summary(ml0.sqrt350)
## 
## Call:
## lm(formula = sqrt(350 - train0$months) ~ ., data = train0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4296  -2.0157  -0.0706   2.0165   9.4299 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.352232   0.877636  -1.541  0.12346    
## popularity       -0.097745   0.002825 -34.598  < 2e-16 ***
## duration          1.011761   0.094457  10.711  < 2e-16 ***
## acousticness     -1.411834   0.342480  -4.122 3.84e-05 ***
## danceability      2.630235   0.547828   4.801 1.64e-06 ***
## energy            3.783006   0.564287   6.704 2.36e-11 ***
## instrumentalness -3.941710   0.731012  -5.392 7.43e-08 ***
## key1              0.040263   0.250035   0.161  0.87208    
## key2              0.235396   0.309551   0.760  0.44704    
## key3              0.212147   0.346214   0.613  0.54007    
## key4              0.712502   0.245377   2.904  0.00371 ** 
## key5              0.187106   0.237926   0.786  0.43169    
## key6              0.315464   0.248436   1.270  0.20424    
## key7             -0.036510   0.272238  -0.134  0.89332    
## key8             -0.112283   0.294087  -0.382  0.70263    
## key9              0.678401   0.252156   2.690  0.00717 ** 
## key10             0.660914   0.259544   2.546  0.01093 *  
## key11             0.589000   0.231778   2.541  0.01109 *  
## loudness         -0.560716   0.035261 -15.902  < 2e-16 ***
## speechiness      -1.819528   0.794423  -2.290  0.02206 *  
## tempo             0.005484   0.002092   2.622  0.00879 ** 
## valence           1.494620   0.300109   4.980 6.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.89 on 3482 degrees of freedom
## Multiple R-squared:  0.4249, Adjusted R-squared:  0.4214 
## F-statistic: 122.5 on 21 and 3482 DF,  p-value: < 2.2e-16

check diagnostic plots

plot(ml0.sqrt350)

Predictions and performance diagnostics

yhat.ml0.sqrt350 = predict(ml0.sqrt350, newdata = test0 %>% dplyr::mutate(months = sqrt(350-test0$months)))
data.frame(
  RMSE = RMSE(yhat.ml0.sqrt350, sqrt(350-test0$months)),
  R2 = R2(yhat.ml0.sqrt350, sqrt(350-test0$months))
)
##       RMSE        R2
## 1 2.894936 0.3988485

Applying the log(360 - y) transform to all data

train0$months <- log(360 - train0$months)
test0$months <- log(360 - test0$months)

train1$months <- log(360 - train1$months)
test1$months <- log(360 - test1$months)

Multiple linear regression for mode 0 songs

Full Multiple Linear Regression

mlr0 <- lm(months ~. , data = train0)

assess model assumptions

plot(mlr0)

view model

summary(mlr0)
## 
## Call:
## lm(formula = months ~ ., data = train0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39175 -0.40146  0.04915  0.44693  1.61319 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.7389301  0.1848139  14.820  < 2e-16 ***
## popularity       -0.0215638  0.0005949 -36.246  < 2e-16 ***
## duration          0.1940970  0.0198910   9.758  < 2e-16 ***
## acousticness     -0.2401500  0.0721199  -3.330 0.000878 ***
## danceability      0.4237910  0.1153624   3.674 0.000243 ***
## energy            0.5720095  0.1188284   4.814 1.54e-06 ***
## instrumentalness -0.8481059  0.1539376  -5.509 3.86e-08 ***
## key1             -0.0127793  0.0526528  -0.243 0.808245    
## key2              0.0157264  0.0651857   0.241 0.809372    
## key3              0.0354738  0.0729063   0.487 0.626596    
## key4              0.1181835  0.0516720   2.287 0.022245 *  
## key5              0.0170423  0.0501029   0.340 0.733768    
## key6              0.0505054  0.0523161   0.965 0.334417    
## key7             -0.0435365  0.0573283  -0.759 0.447650    
## key8             -0.0555081  0.0619293  -0.896 0.370147    
## key9              0.1087090  0.0530993   2.047 0.040706 *  
## key10             0.1264746  0.0546551   2.314 0.020723 *  
## key11             0.1012854  0.0488081   2.075 0.038044 *  
## loudness         -0.0871796  0.0074253 -11.741  < 2e-16 ***
## speechiness      -0.3296733  0.1672909  -1.971 0.048842 *  
## tempo             0.0009236  0.0004405   2.097 0.036072 *  
## valence           0.3243115  0.0631975   5.132 3.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6087 on 3482 degrees of freedom
## Multiple R-squared:  0.4105, Adjusted R-squared:  0.4069 
## F-statistic: 115.4 on 21 and 3482 DF,  p-value: < 2.2e-16

All variables except Key1, Key2, Key3, Kdy5, Key6, Key7, and Key8 are significant to the model in predicting the month of song release.

The Full Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.739 - 0.022{popularity} + 0.194{duration} - 0.240{acousticness} \\ + 0.424{danceability} + 0.572{energy} - 0.848{instrumentalness} \\ - 0.013{key1} + 0.016{key2} + 0.035{key3} + 0.118{key4} + 0.017{key5} + 0.051{key6} \\ - 0.087{key7} -0.056{key8} + 0.109{key9} + 0.126{key10} + 0.101{key11} \\ - 0.087{loudness} - 0.330{speechiness} + 0.001{tempo} + 0.324{valence} \]

# prediction on test data
yhat.mlr0 = predict(mlr0, newdata=test0)
# RMSE for test data
error.mlr0 <- yhat.mlr0 - test0$months
rmse.mlr0 <- sqrt(mean(error.mlr0^2))
data.frame(
  RMSE = RMSE(yhat.mlr0, test0$months),
  R2 = R2(yhat.mlr0, test0$months)
)
##        RMSE        R2
## 1 0.5976752 0.3932124

Variable Selection: Stepwise 10 fold Cross Validation

set.seed(123)
train_control <- trainControl(method = "cv", 
                              number = 10) 
bye <- capture.output(mlr.step_kcv0 <- train(months ~. , data = train0,  
                 method = "lmStepAIC", trControl = train_control)) 
print(mlr.step_kcv0)
## Linear Regression with Stepwise Selection 
## 
## 3504 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3153, 3153, 3153, 3154, 3155, 3153, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.6115382  0.4023572  0.4944596
mlr.step_kcv0$finalModel
## 
## Call:
## lm(formula = .outcome ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key4 + key6 + 
##     key9 + key10 + key11 + loudness + speechiness + tempo + valence, 
##     data = dat)
## 
## Coefficients:
##      (Intercept)        popularity          duration      acousticness  
##        2.7327797        -0.0215505         0.1933518        -0.2409479  
##     danceability            energy  instrumentalness              key4  
##        0.4254183         0.5722443        -0.8523074         0.1245219  
##             key6              key9             key10             key11  
##        0.0567355         0.1150826         0.1327266         0.1074954  
##         loudness       speechiness             tempo           valence  
##       -0.0870517        -0.3278520         0.0009343         0.3244776

The Stepwise Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.732 - 0.022{popularity} + 0.193{duration} - 0.241{acousticness} \\ + 0.425{danceability} + 0.572{energy} - 0.852{instrumentalness} \\ + 0.125{key4} + 0.057{key6} \\ + 0.115{key9} + 0.133{key10} + 0.107{key11} \\ - 0.087{loudness} - 0.328{speechiness} + 0.001{tempo} + 0.324{valence} \]

Compared to the full multiple linear regression model, the stepwise regression model omits the variables: Key1, Key2, Key3, Key5, Key7, and Key8.

prediction on test data

# prediction on test data
yhat.step_kcv0 = predict(mlr.step_kcv0$finalModel, newdata=key.dummy(test0))
# RMSE for test data
error.step_kcv0 <- yhat.step_kcv0 - test0$months
rmse.step_kcv0 <- sqrt(mean(error.step_kcv0^2))
data.frame(
  RMSE = RMSE(yhat.step_kcv0, test0$months),
  R2 = R2(yhat.step_kcv0, test0$months)
)
##        RMSE        R2
## 1 0.5980156 0.3925419

Variable Selection: All Subsets Regression (Not using K-folds CV)

allreg.models0 <- regsubsets(months ~., data = train0, nvmax = 21)
summary(allreg.models0)
## Subset selection object
## Call: regsubsets.formula(months ~ ., data = train0, nvmax = 21)
## 21 Variables  (and intercept)
##                  Forced in Forced out
## popularity           FALSE      FALSE
## duration             FALSE      FALSE
## acousticness         FALSE      FALSE
## danceability         FALSE      FALSE
## energy               FALSE      FALSE
## instrumentalness     FALSE      FALSE
## key1                 FALSE      FALSE
## key2                 FALSE      FALSE
## key3                 FALSE      FALSE
## key4                 FALSE      FALSE
## key5                 FALSE      FALSE
## key6                 FALSE      FALSE
## key7                 FALSE      FALSE
## key8                 FALSE      FALSE
## key9                 FALSE      FALSE
## key10                FALSE      FALSE
## key11                FALSE      FALSE
## loudness             FALSE      FALSE
## speechiness          FALSE      FALSE
## tempo                FALSE      FALSE
## valence              FALSE      FALSE
## 1 subsets of each size up to 21
## Selection Algorithm: exhaustive
##           popularity duration acousticness danceability energy instrumentalness
## 1  ( 1 )  "*"        " "      " "          " "          " "    " "             
## 2  ( 1 )  "*"        "*"      " "          " "          " "    " "             
## 3  ( 1 )  "*"        "*"      " "          "*"          " "    " "             
## 4  ( 1 )  "*"        "*"      " "          " "          " "    " "             
## 5  ( 1 )  "*"        "*"      " "          " "          "*"    " "             
## 6  ( 1 )  "*"        "*"      " "          " "          "*"    "*"             
## 7  ( 1 )  "*"        "*"      " "          "*"          "*"    "*"             
## 8  ( 1 )  "*"        "*"      "*"          "*"          "*"    "*"             
## 9  ( 1 )  "*"        "*"      "*"          "*"          "*"    "*"             
## 10  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 11  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 12  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 13  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 14  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 15  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 16  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 17  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 18  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 19  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 20  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 21  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
##           key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 loudness
## 1  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 2  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 3  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 4  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 5  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 6  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 7  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 8  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 9  ( 1 )  " "  " "  " "  " "  " "  " "  " "  "*"  " "  " "   " "   "*"     
## 10  ( 1 ) " "  " "  " "  " "  " "  " "  "*"  "*"  " "  " "   " "   "*"     
## 11  ( 1 ) " "  " "  " "  "*"  " "  " "  " "  " "  " "  "*"   "*"   "*"     
## 12  ( 1 ) " "  " "  " "  "*"  " "  " "  " "  " "  "*"  "*"   "*"   "*"     
## 13  ( 1 ) " "  " "  " "  "*"  " "  " "  " "  " "  "*"  "*"   "*"   "*"     
## 14  ( 1 ) " "  " "  " "  "*"  " "  " "  " "  " "  "*"  "*"   "*"   "*"     
## 15  ( 1 ) " "  " "  " "  "*"  " "  "*"  " "  " "  "*"  "*"   "*"   "*"     
## 16  ( 1 ) " "  " "  " "  "*"  " "  "*"  " "  "*"  "*"  "*"   "*"   "*"     
## 17  ( 1 ) " "  " "  " "  "*"  " "  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
## 18  ( 1 ) "*"  " "  " "  "*"  " "  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
## 19  ( 1 ) "*"  " "  "*"  "*"  " "  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
## 20  ( 1 ) "*"  " "  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
## 21  ( 1 ) "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
##           speechiness tempo valence
## 1  ( 1 )  " "         " "   " "    
## 2  ( 1 )  " "         " "   " "    
## 3  ( 1 )  " "         " "   " "    
## 4  ( 1 )  " "         " "   "*"    
## 5  ( 1 )  " "         " "   "*"    
## 6  ( 1 )  " "         " "   "*"    
## 7  ( 1 )  " "         " "   "*"    
## 8  ( 1 )  " "         " "   "*"    
## 9  ( 1 )  " "         " "   "*"    
## 10  ( 1 ) " "         " "   "*"    
## 11  ( 1 ) " "         " "   "*"    
## 12  ( 1 ) " "         " "   "*"    
## 13  ( 1 ) " "         "*"   "*"    
## 14  ( 1 ) "*"         "*"   "*"    
## 15  ( 1 ) "*"         "*"   "*"    
## 16  ( 1 ) "*"         "*"   "*"    
## 17  ( 1 ) "*"         "*"   "*"    
## 18  ( 1 ) "*"         "*"   "*"    
## 19  ( 1 ) "*"         "*"   "*"    
## 20  ( 1 ) "*"         "*"   "*"    
## 21  ( 1 ) "*"         "*"   "*"

Assess models

allreg.res0 <- summary(allreg.models0)
allreg.compare0 <- data.frame(model = c(1:21),
                              Adj.R2 = allreg.res0$adjr2,
                              CP = allreg.res0$cp)
allreg.compare0
##    model    Adj.R2        CP
## 1      1 0.3462533 360.15498
## 2      2 0.3591335 285.02115
## 3      3 0.3707186 217.57361
## 4      4 0.3838071 141.29567
## 5      5 0.3921850  92.84423
## 6      6 0.3971156  64.74779
## 7      7 0.4005655  45.39542
## 8      8 0.4025885  34.46326
## 9      9 0.4033295  31.09094
## 10    10 0.4041639  27.17060
## 11    11 0.4050357  23.03289
## 12    12 0.4061457  17.49622
## 13    13 0.4066369  15.60448
## 14    14 0.4071063  13.84283
## 15    15 0.4073589  13.35758
## 16    16 0.4074023  14.10287
## 17    17 0.4074617  14.75469
## 18    18 0.4073768  16.25484
## 19    19 0.4072293  18.12198
## 20    20 0.4070700  20.05820
## 21    21 0.4069096  22.00000

The model with the smallest CP value and greatest Adjusted R2 value is model number 15.

mlr.allreg0 <- lm(months ~ popularity +duration +acousticness +danceability +energy +instrumentalness+
  key4+key6+key9 +key10 +key11 +loudness +speechiness+ tempo +valence, data = key.dummy(train0))
summary(mlr.allreg0)
## 
## Call:
## lm(formula = months ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key4 + key6 + 
##     key9 + key10 + key11 + loudness + speechiness + tempo + valence, 
##     data = key.dummy(train0))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.38943 -0.40057  0.04912  0.44462  1.61241 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.7327797  0.1802230  15.163  < 2e-16 ***
## popularity       -0.0215505  0.0005940 -36.280  < 2e-16 ***
## duration          0.1933518  0.0198500   9.741  < 2e-16 ***
## acousticness     -0.2409479  0.0719685  -3.348 0.000823 ***
## danceability      0.4254183  0.1152193   3.692 0.000226 ***
## energy            0.5722443  0.1185570   4.827 1.45e-06 ***
## instrumentalness -0.8523074  0.1537756  -5.543 3.20e-08 ***
## key4              0.1245219  0.0351732   3.540 0.000405 ***
## key6              0.0567355  0.0359754   1.577 0.114871    
## key9              0.1150826  0.0371718   3.096 0.001977 ** 
## key10             0.1327266  0.0392897   3.378 0.000738 ***
## key11             0.1074954  0.0307223   3.499 0.000473 ***
## loudness         -0.0870517  0.0074155 -11.739  < 2e-16 ***
## speechiness      -0.3278520  0.1670495  -1.963 0.049772 *  
## tempo             0.0009343  0.0004402   2.122 0.033869 *  
## valence           0.3244776  0.0629704   5.153 2.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6085 on 3488 degrees of freedom
## Multiple R-squared:  0.4099, Adjusted R-squared:  0.4074 
## F-statistic: 161.5 on 15 and 3488 DF,  p-value: < 2.2e-16

prediction on test data

# prediction on test data
yhat.allreg0 = predict(mlr.allreg0, newdata=key.dummy(test0))
# RMSE for test data
# error.allreg0 <- yhat.allreg0 - test0$months
# rmse.allreg0 <- sqrt(mean(error.allreg0^2))
data.frame(
  RMSE = RMSE(yhat.allreg0, test0$months),
  R2 = R2(yhat.allreg0, test0$months)
)
##        RMSE        R2
## 1 0.5980156 0.3925419

Regularized Regression: Ridge

lm.ridge0 <- train(months ~. , data = train0, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.ridge0$bestTune
##    alpha lambda
## 45     0  0.045
# best coefficient
lm.ridge0.model <- coef(lm.ridge0$finalModel, lm.ridge0$bestTune$lambda)
lm.ridge0.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       2.8657735959
## popularity       -0.0205267725
## duration          0.1899716805
## acousticness     -0.2445830342
## danceability      0.4128163847
## energy            0.4823365034
## instrumentalness -0.7491778746
## key1             -0.0342147260
## key2             -0.0056979626
## key3              0.0075189682
## key4              0.0953873860
## key5             -0.0072177244
## key6              0.0225877475
## key7             -0.0627117017
## key8             -0.0749979230
## key9              0.0823586421
## key10             0.0986420788
## key11             0.0783989653
## loudness         -0.0798794742
## speechiness      -0.3084093591
## tempo             0.0008529497
## valence           0.3234950755

The Ridge Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.866 - 0.021{popularity} + 0.190{duration} - 0.245{acousticness} \\ + 0.413{danceability} + 0.482{energy} - 0.749{instrumentalness} \\ - 0.034{key1} - 0.006{key2} + 0.008{key3} + 0.095{key4} - 0.007{key5} + 0.023{key6} \\ - 0.063{key7} - 0.075{key8} + 0.082{key9} + 0.099{key10} + 0.078{key11} \\ - 0.080{loudness} - 0.308{speechiness} + 0.001{tempo} + 0.323{valence} \]

# prediction on test data
yhat.ridge0 = predict(lm.ridge0, s=lm.ridge0$bestTune,newdata=test0)
# RMSE for test data
data.frame(
  RMSE = RMSE(yhat.ridge0, test0$months),
  R2 = R2(yhat.ridge0, test0$months)
)
##        RMSE        R2
## 1 0.5981751 0.3921513

Regularized Regression: Lasso

lm.lasso0 <- train(months ~. , data = train0, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.lasso0$bestTune
##   alpha lambda
## 2     1  0.002
# best coefficient
lm.lasso0.model <- coef(lm.lasso0$finalModel, lm.lasso0$bestTune$lambda)
lm.lasso0.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       2.8365852962
## popularity       -0.0215614447
## duration          0.1908720137
## acousticness     -0.2395615122
## danceability      0.4088128971
## energy            0.5333749094
## instrumentalness -0.8078442972
## key1             -0.0270821553
## key2              .           
## key3              0.0040833394
## key4              0.0923273187
## key5              .           
## key6              0.0238454868
## key7             -0.0555632703
## key8             -0.0676519346
## key9              0.0819159291
## key10             0.0992058126
## key11             0.0770663223
## loudness         -0.0839496407
## speechiness      -0.2803564060
## tempo             0.0008109799
## valence           0.3212503863

The Lasso Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.837 - 0.022{popularity} + 0.191{duration} - 0.240{acousticness} \\ + 0.409{danceability} + 0.533{energy} - 0.808{instrumentalness} \\ - 0.027{key1} + 0.004{key3} + 0.092{key4} + 0.024{key6} \\ - 0.056{key7} - 0.068{key8} + 0.082{key9} + 0.099{key10} + 0.077{key11} \\ - 0.084{loudness} - 0.280{speechiness} + 0.001{tempo} + 0.321{valence} \]

Compared to the full model, the variables key2 and key 5 are not kept in the model.

# prediction on test data
yhat.lasso0 = predict(lm.lasso0, s=lm.lasso0$bestTune,newdata=test0)
# RMSE for test data
error.lasso0 <- yhat.lasso0 - test0$months
rmse.lasso0 <- sqrt(mean(error.lasso0^2))
data.frame(
  RMSE = RMSE(yhat.lasso0, test0$months),
  R2 = R2(yhat.lasso0, test0$months)
)
##        RMSE        R2
## 1 0.5975859 0.3932411

Deciding on the best model

results0 <- data.frame(Model = c("FullMLR", "Stepwise", "AllSubsets","Ridge", "Lasso"),
                       RMSE = c(0.5976752 , 0.5980156, 0.5980156,0.5981751,0.5975859),
                       R2 = c(0.3932124, 0.3925419, 0.3925419,0.3921513,0.3932411))
results0
##        Model      RMSE        R2
## 1    FullMLR 0.5976752 0.3932124
## 2   Stepwise 0.5980156 0.3925419
## 3 AllSubsets 0.5980156 0.3925419
## 4      Ridge 0.5981751 0.3921513
## 5      Lasso 0.5975859 0.3932411

Lasso Provides the best model

\[ log(360 - \hat{months}) = 2.837 - 0.022{popularity} + 0.191{duration} - 0.240{acousticness} \\ + 0.409{danceability} + 0.533{energy} - 0.808{instrumentalness} \\ - 0.027{key1} + 0.004{key4} + 0.092{key4} + 0.024{key6} \\ - 0.056{key7} - 0.068{key8} + 0.082{key9} + 0.099{key10} + 0.077{key11} \\ - 0.084{loudness} - 0.280{speechiness} + 0.001{tempo} + 0.321{valence} \]

Multiple linear regression for mode 1 songs

Full Multiple Linear Regression

mlr1 <- lm(months ~. , data = train1)

assess model assumptions

plot(mlr1)

view model

summary(mlr1)
## 
## Call:
## lm(formula = months ~ ., data = train1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48440 -0.39951  0.05378  0.45331  1.76459 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.9901418  0.1434374  20.846  < 2e-16 ***
## popularity       -0.0198841  0.0005122 -38.818  < 2e-16 ***
## duration          0.1251172  0.0155343   8.054 9.73e-16 ***
## acousticness     -0.1274822  0.0508609  -2.506  0.01222 *  
## danceability      0.6314340  0.0873034   7.233 5.39e-13 ***
## energy            0.3826541  0.0973081   3.932 8.51e-05 ***
## instrumentalness -0.7706272  0.1212973  -6.353 2.28e-10 ***
## key1              0.0842570  0.0314613   2.678  0.00743 ** 
## key2              0.0287030  0.0336488   0.853  0.39369    
## key3              0.0202750  0.0516321   0.393  0.69457    
## key4              0.0680017  0.0436575   1.558  0.11938    
## key5              0.0749496  0.0383254   1.956  0.05056 .  
## key6              0.0957015  0.0375012   2.552  0.01074 *  
## key7              0.1019746  0.0312840   3.260  0.00112 ** 
## key8              0.1189275  0.0362960   3.277  0.00106 ** 
## key9              0.0124663  0.0389359   0.320  0.74885    
## key10             0.0810947  0.0453378   1.789  0.07372 .  
## key11             0.0463297  0.0403927   1.147  0.25144    
## loudness         -0.0757560  0.0061732 -12.272  < 2e-16 ***
## speechiness      -0.1799930  0.1442184  -1.248  0.21206    
## tempo             0.0007324  0.0003286   2.229  0.02589 *  
## valence           0.2936440  0.0531944   5.520 3.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6245 on 5482 degrees of freedom
## Multiple R-squared:  0.3206, Adjusted R-squared:  0.318 
## F-statistic: 123.2 on 21 and 5482 DF,  p-value: < 2.2e-16

The Full Multiple Linear Regression model for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.990 - 0.020{popularity} + 0.125{duration} - 0.127{acousticness} \\ + 0.631{danceability} + 0.383{energy} - 0.771{instrumentalness} \\ + 0.084{key1} + 0.029{key2} + 0.020{key3} + 0.068{key4} + 0.075{key5} + 0.096{key6} \\ + 0.102{key7} + 0.119{key8} + 0.012{key9} + 0.081{key10} + 0.046{key11} \\ - 0.076{loudness} - 0.180{speechiness} + 0.001{tempo} + 0.294{valence} \]

# prediction on test data
yhat.mlr1 = predict(mlr1, newdata=test1)
# RMSE for test data
error.mlr1 <- yhat.mlr1 - test1$months
rmse.mlr1 <- sqrt(mean(error.mlr1^2))
data.frame(
  RMSE = RMSE(yhat.mlr1, test1$months),
  R2 = R2(yhat.mlr1, test1$months)
)
##        RMSE        R2
## 1 0.6078257 0.3312429

Variable Selection: Stepwise 10 fold Cross Validation

set.seed(123)
train_control <- trainControl(method = "cv", 
                              number = 10) 
bye <- capture.output(mlr.step_kcv1 <- train(months ~. , data = train1,  
                 method = "lmStepAIC", trControl = train_control)) 
print(mlr.step_kcv1)
## Linear Regression with Stepwise Selection 
## 
## 5504 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4953, 4953, 4955, 4954, 4953, 4954, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.6266852  0.3145441  0.5035447
mlr.step_kcv1$finalModel
## 
## Call:
## lm(formula = .outcome ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key1 + key5 + 
##     key6 + key7 + key8 + key10 + loudness + tempo + valence, 
##     data = dat)
## 
## Coefficients:
##      (Intercept)        popularity          duration      acousticness  
##        3.0299902        -0.0199787         0.1273096        -0.1239002  
##     danceability            energy  instrumentalness              key1  
##        0.6302868         0.3566622        -0.7653059         0.0616367  
##             key5              key6              key7              key8  
##        0.0529274         0.0732353         0.0786565         0.0959522  
##            key10          loudness             tempo           valence  
##        0.0597350        -0.0739348         0.0006867         0.2887687

The Multiple Linear Regression model resulting from stepwise 10 fold cross validation for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 3.020 - 0.020{popularity} + 0.127{duration} - 0.124{acousticness} \\ + 0.630{danceability} + 0.357{energy} - 0.765{instrumentalness} + 0.062{key1} \\ + 0.053{key5} + 0.073{key6} + 0.079{key7} + 0.096{key8} + 0.060{key10} \\ - 0.074{loudness} + 0.001{tempo} + 0.289{valence} \]

Compared to the full multiple linear regression model, the stewise multiple regression leaves out the dummy variables of Key2, key3, key4, key11 and speechiness.

prediction on test data

# prediction on test data
yhat.step_kcv1 = predict(mlr.step_kcv1$finalModel, newdata=key.dummy(test1))
# RMSE for test data
error.step_kcv1 <- yhat.step_kcv1 - test1$months
rmse.step_kcv1 <- sqrt(mean(error.step_kcv1^2))
data.frame(
  RMSE = RMSE(yhat.step_kcv1, test1$months),
  R2 = R2(yhat.step_kcv1, test1$months)
)
##        RMSE        R2
## 1 0.6070693 0.3329448

Variable Selection: All Subsets Regression (Not using K-folds CV)

allreg.models1 <- regsubsets(months ~., data = train1, nvmax = 21)
summary(allreg.models1)
## Subset selection object
## Call: regsubsets.formula(months ~ ., data = train1, nvmax = 21)
## 21 Variables  (and intercept)
##                  Forced in Forced out
## popularity           FALSE      FALSE
## duration             FALSE      FALSE
## acousticness         FALSE      FALSE
## danceability         FALSE      FALSE
## energy               FALSE      FALSE
## instrumentalness     FALSE      FALSE
## key1                 FALSE      FALSE
## key2                 FALSE      FALSE
## key3                 FALSE      FALSE
## key4                 FALSE      FALSE
## key5                 FALSE      FALSE
## key6                 FALSE      FALSE
## key7                 FALSE      FALSE
## key8                 FALSE      FALSE
## key9                 FALSE      FALSE
## key10                FALSE      FALSE
## key11                FALSE      FALSE
## loudness             FALSE      FALSE
## speechiness          FALSE      FALSE
## tempo                FALSE      FALSE
## valence              FALSE      FALSE
## 1 subsets of each size up to 21
## Selection Algorithm: exhaustive
##           popularity duration acousticness danceability energy instrumentalness
## 1  ( 1 )  "*"        " "      " "          " "          " "    " "             
## 2  ( 1 )  "*"        " "      " "          "*"          " "    " "             
## 3  ( 1 )  "*"        " "      " "          " "          " "    " "             
## 4  ( 1 )  "*"        "*"      " "          " "          " "    " "             
## 5  ( 1 )  "*"        "*"      " "          "*"          "*"    " "             
## 6  ( 1 )  "*"        "*"      " "          "*"          "*"    "*"             
## 7  ( 1 )  "*"        "*"      " "          "*"          "*"    "*"             
## 8  ( 1 )  "*"        "*"      "*"          "*"          "*"    "*"             
## 9  ( 1 )  "*"        "*"      "*"          "*"          "*"    "*"             
## 10  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 11  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 12  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 13  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 14  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 15  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 16  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 17  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 18  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 19  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 20  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
## 21  ( 1 ) "*"        "*"      "*"          "*"          "*"    "*"             
##           key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 loudness
## 1  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 2  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   " "     
## 3  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 4  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 5  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 6  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 7  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 8  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 9  ( 1 )  " "  " "  " "  " "  " "  " "  " "  " "  " "  " "   " "   "*"     
## 10  ( 1 ) " "  " "  " "  " "  " "  " "  "*"  "*"  " "  " "   " "   "*"     
## 11  ( 1 ) " "  " "  " "  " "  " "  " "  "*"  "*"  " "  " "   " "   "*"     
## 12  ( 1 ) "*"  " "  " "  " "  " "  " "  "*"  "*"  " "  " "   " "   "*"     
## 13  ( 1 ) "*"  " "  " "  " "  " "  "*"  "*"  "*"  " "  " "   " "   "*"     
## 14  ( 1 ) "*"  " "  " "  " "  "*"  "*"  "*"  "*"  " "  " "   " "   "*"     
## 15  ( 1 ) "*"  " "  " "  " "  "*"  "*"  "*"  "*"  " "  "*"   " "   "*"     
## 16  ( 1 ) "*"  " "  " "  "*"  "*"  "*"  "*"  "*"  " "  "*"   " "   "*"     
## 17  ( 1 ) "*"  " "  " "  "*"  "*"  "*"  "*"  "*"  " "  "*"   " "   "*"     
## 18  ( 1 ) "*"  " "  " "  "*"  "*"  "*"  "*"  "*"  " "  "*"   "*"   "*"     
## 19  ( 1 ) "*"  "*"  " "  "*"  "*"  "*"  "*"  "*"  " "  "*"   "*"   "*"     
## 20  ( 1 ) "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  " "  "*"   "*"   "*"     
## 21  ( 1 ) "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"  "*"   "*"   "*"     
##           speechiness tempo valence
## 1  ( 1 )  " "         " "   " "    
## 2  ( 1 )  " "         " "   " "    
## 3  ( 1 )  " "         " "   "*"    
## 4  ( 1 )  " "         " "   "*"    
## 5  ( 1 )  " "         " "   " "    
## 6  ( 1 )  " "         " "   " "    
## 7  ( 1 )  " "         " "   "*"    
## 8  ( 1 )  " "         " "   "*"    
## 9  ( 1 )  " "         "*"   "*"    
## 10  ( 1 ) " "         " "   "*"    
## 11  ( 1 ) " "         "*"   "*"    
## 12  ( 1 ) " "         "*"   "*"    
## 13  ( 1 ) " "         "*"   "*"    
## 14  ( 1 ) " "         "*"   "*"    
## 15  ( 1 ) " "         "*"   "*"    
## 16  ( 1 ) " "         "*"   "*"    
## 17  ( 1 ) "*"         "*"   "*"    
## 18  ( 1 ) "*"         "*"   "*"    
## 19  ( 1 ) "*"         "*"   "*"    
## 20  ( 1 ) "*"         "*"   "*"    
## 21  ( 1 ) "*"         "*"   "*"
allreg.res1 <- summary(allreg.models1)
allreg.compare1 <- data.frame(model = c(1:21),
                              Adj.R2 = allreg.res1$adjr2,
                              CP = allreg.res1$cp)
allreg.compare1
##    model    Adj.R2        CP
## 1      1 0.2662577 419.01033
## 2      2 0.2809405 301.51148
## 3      3 0.2915207 217.13918
## 4      4 0.2984165 162.50349
## 5      5 0.3064961  98.34476
## 6      6 0.3114590  59.32933
## 7      7 0.3153303  29.12450
## 8      8 0.3160799  24.08108
## 9      9 0.3164721  21.91925
## 10    10 0.3169039  19.43997
## 11    11 0.3173029  17.22495
## 12    12 0.3175376  16.33495
## 13    13 0.3178574  14.76005
## 14    14 0.3179760  14.80510
## 15    15 0.3181035  14.77934
## 16    16 0.3181764  15.19312
## 17    17 0.3182498  15.60317
## 18    18 0.3182303  16.76046
## 19    19 0.3181744  18.21054
## 20    20 0.3180635  20.10251
## 21    21 0.3179518  22.00000

The model with the smallest CP value and greatest Adjusted R2 value is model number 15.

mlr.allreg1 <- lm(months ~ popularity +duration +acousticness +danceability +energy +instrumentalness+
  key1+key5 +key6 +key7 +key8+loudness + tempo +valence, data = key.dummy(train1))
summary(mlr.allreg1)
## 
## Call:
## lm(formula = months ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key1 + key5 + 
##     key6 + key7 + key8 + loudness + tempo + valence, data = key.dummy(train1))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48754 -0.40040  0.05556  0.45256  1.83006 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.0334496  0.1417347  21.402  < 2e-16 ***
## popularity       -0.0199882  0.0005048 -39.599  < 2e-16 ***
## duration          0.1277720  0.0153891   8.303  < 2e-16 ***
## acousticness     -0.1219629  0.0506774  -2.407 0.016132 *  
## danceability      0.6298047  0.0871300   7.228 5.56e-13 ***
## energy            0.3572279  0.0950323   3.759 0.000172 ***
## instrumentalness -0.7678735  0.1209540  -6.348 2.35e-10 ***
## key1              0.0567054  0.0264327   2.145 0.031974 *  
## key5              0.0478245  0.0342038   1.398 0.162103    
## key6              0.0682239  0.0333567   2.045 0.040874 *  
## key7              0.0738130  0.0261831   2.819 0.004833 ** 
## key8              0.0910011  0.0319686   2.847 0.004436 ** 
## loudness         -0.0740407  0.0060210 -12.297  < 2e-16 ***
## tempo             0.0006795  0.0003269   2.078 0.037717 *  
## valence           0.2887193  0.0530373   5.444 5.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6245 on 5489 degrees of freedom
## Multiple R-squared:  0.3197, Adjusted R-squared:  0.318 
## F-statistic: 184.3 on 14 and 5489 DF,  p-value: < 2.2e-16

prediction on test data

# prediction on test data
yhat.allreg1 = predict(mlr.allreg1, newdata=key.dummy(test1))
# RMSE for test data
# error.allreg0 <- yhat.allreg0 - test0$months
# rmse.allreg0 <- sqrt(mean(error.allreg0^2))
data.frame(
  RMSE = RMSE(yhat.allreg1, test1$months),
  R2 = R2(yhat.allreg1, test1$months)
)
##        RMSE        R2
## 1 0.6071187 0.3328339

Regularized Regression: Ridge

lm.ridge1 <- train(months ~. , data = train1, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.ridge1$bestTune
##    alpha lambda
## 38     0  0.038
# best coefficient
lm.ridge1.model <- coef(lm.ridge1$finalModel, lm.ridge1$bestTune$lambda)
lm.ridge1.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       3.1431295345
## popularity       -0.0190324693
## duration          0.1221767160
## acousticness     -0.1418149731
## danceability      0.5970842302
## energy            0.2791295082
## instrumentalness -0.6904944050
## key1              0.0657555894
## key2              0.0117781483
## key3              0.0010085253
## key4              0.0500939827
## key5              0.0563973003
## key6              0.0773915151
## key7              0.0810143099
## key8              0.0962618966
## key9             -0.0050702623
## key10             0.0647627234
## key11             0.0284046850
## loudness         -0.0675138250
## speechiness      -0.1577203949
## tempo             0.0006447657
## valence           0.2966971890

The Ridge Multiple Linear Regression model for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 3.143 - 0.019{popularity} + 0.122{duration} - 0.142{acousticness} \\ + 0.597{danceability} + 0.279{energy} - 0.690{instrumentalness} \\ + 0.066{key1} + 0.012{key2} + 0.001{key3} + 0.050{key4} + 0.056{key5} + 0.077{key6} \\ + 0.081{key7} + 0.096{key8} - 0.005{key9} + 0.065{key10} + 0.028{key11} \\ - 0.068{loudness} - 0.158{speechiness} + 0.001{tempo} + 0.297{valence} \]

The ridge model keeps all variables, therefore, there have been none removed for the model. However, the ridge model does aim to reduce all coefficients closer to zero.

# prediction on test data
yhat.ridge1 = predict(lm.ridge1, s=lm.ridge1$bestTune,newdata=test1)
# RMSE for test data
data.frame(
  RMSE = RMSE(yhat.ridge1, test1$months),
  R2 = R2(yhat.ridge1, test1$months)
)
##       RMSE       R2
## 1 0.607864 0.332307

Regularized Regression: Lasso

lm.lasso1 <- train(months ~. , data = train1, method = "glmnet",
                      trControl = train_control, 
                      tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.lasso1$bestTune
##   alpha lambda
## 1     1  0.001
# best coefficient
lm.lasso1.model <- coef(lm.lasso1$finalModel, lm.lasso1$bestTune$lambda)
lm.lasso1.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       3.0552331450
## popularity       -0.0198737952
## duration          0.1234290019
## acousticness     -0.1271194063
## danceability      0.6223931899
## energy            0.3536795015
## instrumentalness -0.7502129375
## key1              0.0663408994
## key2              0.0099288569
## key3              .           
## key4              0.0472801198
## key5              0.0554833191
## key6              0.0769102693
## key7              0.0835834150
## key8              0.0997023087
## key9              .           
## key10             0.0612161728
## key11             0.0269418262
## loudness         -0.0733775550
## speechiness      -0.1481066141
## tempo             0.0006778449
## valence           0.2918487107

The Lasso Multiple Linear Regression model for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 3.055 - 0.020{popularity} + 0.123{duration} - 0.127{acousticness} \\ + 0.622{danceability} + 0.354{energy} - 0.750{instrumentalness} \\ + 0.066{key1} + 0.010{key2} + 0.047{key4} + 0.055{key5} + 0.077{key6} \\ + 0.084{key7} + 0.0997{key8} + 0.061{key10} + 0.027{key11} \\ - 0.073{loudness} - 0.148{speechiness} + 0.001{tempo} + 0.291{valence} \]

Compared to the full model, the Lasso model omits dummy variables Key3 and Key9

# prediction on test data
yhat.lasso1 = predict(lm.lasso1, s=lm.lasso1$bestTune,newdata=test1)
# RMSE for test data
error.lasso1 <- yhat.lasso1 - test1$months
rmse.lasso1 <- sqrt(mean(error.lasso1^2))
data.frame(
  RMSE = RMSE(yhat.lasso1, test1$months),
  R2 = R2(yhat.lasso1, test1$months)
)
##        RMSE        R2
## 1 0.6073883 0.3322821

Determining Best Model and Interpreting.

results1 <- data.frame(Model = c("FullMLR", "Stepwise", "AllSubsets","Ridge", "Lasso"),
                       RMSE = c(0.6078257, 0.6070693,0.6071187,0.607864,0.6073883),
                       R2 = c(0.3312429, 0.3329448, 0.3328339,0.332307,0.3322821))
results1
##        Model      RMSE        R2
## 1    FullMLR 0.6078257 0.3312429
## 2   Stepwise 0.6070693 0.3329448
## 3 AllSubsets 0.6071187 0.3328339
## 4      Ridge 0.6078640 0.3323070
## 5      Lasso 0.6073883 0.3322821

Stepwise regression is the best model.

summary(mlr.step_kcv1$finalModel)
## 
## Call:
## lm(formula = .outcome ~ popularity + duration + acousticness + 
##     danceability + energy + instrumentalness + key1 + key5 + 
##     key6 + key7 + key8 + key10 + loudness + tempo + valence, 
##     data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4842 -0.4012  0.0563  0.4534  1.7748 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.0299902  0.1417423  21.377  < 2e-16 ***
## popularity       -0.0199787  0.0005048 -39.580  < 2e-16 ***
## duration          0.1273096  0.0153911   8.272  < 2e-16 ***
## acousticness     -0.1239002  0.0506910  -2.444 0.014548 *  
## danceability      0.6302868  0.0871225   7.234 5.31e-13 ***
## energy            0.3566622  0.0950243   3.753 0.000176 ***
## instrumentalness -0.7653059  0.1209561  -6.327 2.70e-10 ***
## key1              0.0616367  0.0266563   2.312 0.020799 *  
## key5              0.0529274  0.0343880   1.539 0.123832    
## key6              0.0732353  0.0335388   2.184 0.029034 *  
## key7              0.0786565  0.0264008   2.979 0.002902 ** 
## key8              0.0959522  0.0321543   2.984 0.002857 ** 
## key10             0.0597350  0.0419649   1.423 0.154663    
## loudness         -0.0739348  0.0060209 -12.280  < 2e-16 ***
## tempo             0.0006867  0.0003269   2.101 0.035717 *  
## valence           0.2887687  0.0530324   5.445 5.40e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6244 on 5488 degrees of freedom
## Multiple R-squared:   0.32,  Adjusted R-squared:  0.3181 
## F-statistic: 172.1 on 15 and 5488 DF,  p-value: < 2.2e-16

\[ log(360 - \hat{months}) = 3.020 - 0.020{popularity} + 0.127{duration} - 0.124{acousticness} \\ + 0.630{danceability} + 0.357{energy} - 0.765{instrumentalness} + 0.062{key1} \\ + 0.053{key5} + 0.073{key6} + 0.079{key7} + 0.096{key8} + 0.060{key10} \\ - 0.074{loudness} + 0.001{tempo} + 0.289{valence} \]